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Abstract 

Content-based  image  retrieval  is  an  important  area  of  research.  Here,  a  method  to  characterize  visual 
appearance  for  determining  global  similarity  in  images  is  described.  Images  are  filtered  with  Gaussian 
derivatives  and  geometric  features  are  computed  from  the  filtered  images.  The  geometric  features  used 
here  are  curvature  and  phase.  Two  images  may  be  said  to  be  similar  if  they  have  similar  distributions  of 
such  features.  Global  similarity  may,  therefore,  be  deduced  by  comparing  histograms  of  these  features. 

This  allows  for  rapid  retrieval.  The  system’s  performance  on  a  database  of  about  1500  grey-level  images 
and  another  database  of  2000  trademark  images  is  shown.  It  is  also  shown  that  the  approach  is  scalable 
and  examples  of  query  results  on  a  database  of  more  than  63000  trademark  images  are  provided. 

1  Introduction 

The  libraries  of  the  future  will  be  increasingly  digital  and  will  need  good  tools  to  navigate,  search  and 
retrieve  relevant  information.  While  there  exist  good  search  engines  for  ASCII  text,  good  ways  of  retriev¬ 
ing  images  are  still  not  available.  A  traditional  approach  to  indexing  images  is  to  manually  create  textual 
annotations  or  keywords  for  each  image.  These  annotations  maybe  later  used  to  retrieve  images.  Manual 
annotation  is  slow,  labor  intensive  and  expensive.  In  addition,  keywords  cannot  always  capture  the  expres¬ 
siveness  of  an  image  or  anticipate  the  different  puiposes  an  image  may  be  used  for.  There  is  thus  a  need  for 
retrieving  images  using  their  content. 

The  indexing  and  retrieval  of  images  using  their  content  is  a  difficult  problem.  A  person  using  an  image 
retrieval  system  usually  seeks  to  find  semantically  relevant  information.  For  example,  a  person  may  be 
looking  for  a  picture  of  a  leopard  from  a  certain  viewpoint.  Or  alternatively,  the  user  may  require  a  picture 
of  Abraham  Lincoln  from  a  particular  viewpoint.  Since  the  automatic  segmentation  of  an  image  into  objects 
is  a  difficult  and  unsolved  problem  in  computer  vision,  inferring  semantic  information  from  image  content  is 
difficult  to  do.  However,  many  image  attributes  like  color,  texture,  shape  and  ’'appearance”  arc  often  directly 
correlated  with  the  semantics  of  the  problem.  For  example,  logos  or  product  packages  (e.g.,  a  box  of  Tide) 
have  the  same  color  wherever  they  are  found.  The  coat  of  a  leopard  has  a  unique  texture  while  Abraham 
Lincoln’s  appearance  is  uniquely  defined.  These  image  attributes  can  often  be  used  to  index  and  retrieve 
images. 

*This  material  is  based  on  work  supported  in  part  by  the  National  Science  Foundation,  Library  of  Congress  and  Department  of 
Commerce  under  cooperative  agreement  number  EEC-9209623,  in  part  by  the  United  States  Patent  and  Trademarks  Office  and  the 
Defense  Advanced  Research  Projects  Agency /ITO  under  ARPA  order  number  D468,  issued  by  ESC/AXS  contract  number  F19628- 
95-C-0235,  in  part  by  the  National  Science  Foundation  under  grant  IRI-96191 17  and  in  part  by  NSF  Multimedia  CDA-9502639. 
Any  opinions,  findings  and  conclusions  or  recommendations  expressed  in  this  material  are  the  author(s)  and  do  not  necessarily 
reflect  those  of  the  sponsors. 
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In  this  paper,  visual  appearance  is  used  to  retrieve  images  from  database  which  are  similar  to  a  query 
image.  The  visual  appearance  of  an  image  is  characterized  here  using  the  shape  of  the  intensity  surface.  The 
images  are  filtered  with  Gaussian  derivatives  and  geometric  features  are  computed  from  the  filtered  images. 
The  geometric  features  used  here  are  the  image  shape  index  (which  is  a  ratio  of  curvatures  of  the  three 
dimensional  intensity  surface)  and  the  local  orientation  of  the  gradient.  Two  images  are  said  to  be  similar  if 
they  have  similar  distributions  of  such  features.  The  images  arc.  therefore,  ranked  by  comparing  histograms 
of  these  features.  The  method  is  demonstrated  with  a  database  of  about  1500  grey  levels  images  of  objects 
such  as  cars,  faces,  apes  and  other  miscellaneous  objects  and  a  second  database  of  about  2000  images.  It  is 
then  shown  that  the  technique  may  be  scaled  to  retrieve  a  database  of  about  63000  trademark  images  from 
the  US  Patent  and  Trademark  Office. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  provides  some  background  on  the  image  retrieval 
area  as  well  as  on  the  appearance  matching  framework  used  in  this  paper.  Section  3  surveys  related  work  in 
the  literature.  In  section  4,  the  notion  of  appearance  is  developed  further  and  characterized  using  Gaussian 
derivative  filters  and  the  derived  global  representation  is  discussed.  Section  5  shows  how  the  representation 
may  be  scaled  to  retrieve  images  from  a  database  of  about  63,000  trademark  images.  A  discussion  and 
conclusion  follows  in  Section  6. 

2  Motivation  and  Background 

The  different  image  attributes  like  color,  texture,  shape  and  appearance  have  all  been  used  in  a  variety 
of  systems  for  retrieving  images  similar  to  a  query  image  (see  3  for  a  review).  Systems  like  QBIC  [5]  and 
Virage  [4]  allow  users  to  combine  color,  texture  and  shape  to  retrieve  a  database  of  general  images.  One 
weakness  of  such  a  system  is  that  attributes  like  color  do  not  have  direct  semantic  correlates  when  applied  to 
a  database  of  general  images.  For  example,  say  a  picture  of  a  red  and  green  parrot  is  used  to  retrieve  images 
based  on  their  similarity  in  color  with  it.  The  retrievals  may  include  other  parrots  and  birds  as  well  as  red 
flowers  with  green  stems  and  other  images.  While  this  is  a  reasonable  result  when  viewed  as  a  matching 
problem,  clearly  it  is  not  a  reasonable  result  for  a  retrieval  system.  The  problem  arises  because  color  does 
not  have  a  good  correlation  with  semantics  when  used  with  general  images.  However,  if  the  domain  or  set 
of  images  is  restricted  to  say  flowers,  then  color  has  a  direct  semantic  correlate  and  is  useful  for  retrieval 
(see  [2]  for  an  example). 

Some  attempts  have  been  made  to  retrieve  objects  using  their  shape  [5,  21].  For  example,  the  QBIC 
system  [5],  developed  by  IBM,  matches  binary  shapes.  It  requires  that  the  database  be  segmented  into 
objects.  Since  automatic  segmentation  is  an  unsolved  problem,  this  requires  the  user  to  manually  outline  the 
objects  in  the  database.  Clearly  this  is  not  desirable  or  practical. 

Except  for  certain  special  domains,  all  methods  based  on  shape  are  likely  to  have  the  same  problem.  An 
object’s  appearance  depends  not  only  on  its  three  dimensional  shape,  but  also  on  the  object’s  albedo,  the 
viewpoint  from  which  it  is  imaged  and  a  number  of  other  factors.  It  is  non-trivial  to  separate  the  differ¬ 
ent  factors  constituting  an  object’s  appearance  and  it  is  usually  not  possible  to  separate  an  object’s  three 
dimensional  shape  from  the  other  factors.  For  example,  the  face  of  a  person  has  a  unique  appearance  that 
cannot  just  be  characterized  by  the  geometric  shape  of  the  ’component  parts’.  In  this  paper  a  characteriza¬ 
tion  of  the  shape  of  the  intensity  surface  of  imaged  objects  is  used  for  retrieval.  The  experiments  conducted 
show  that  retrieved  objects  have  similar  visual  appearance,  and  henceforth  an  association  is  made  between 
’appearance’  and  the  shape  of  the  intensity  surface. 

Similarity  can  be  computed  using  either  local  or  global  methods.  In  local  similarity,  a  part  of  the  query 
is  used  to  match  a  part  of  a  database  image  or  images.  One  approach  to  computing  local  similarity  [17]  is 
to  have  the  user  outline  the  salient  portions  of  the  query  (eg.  the  wheels  of  a  car  or  the  face  of  a  person) 
and  match  the  outlined  portion  of  the  query  with  parts  of  images  in  the  database.  Although,  the  technique 
works  well  in  extracting  relevant  portions  of  objects  embedded  against  backgrounds  it  is  slow.  The  slow 
speed  stems  from  the  fact  that  the  system  must  not  only  answer  the  question  ”is  this  image  similar”  but  also 
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the  question  ’’which  part  of  the  image  is  relevant”. 

This  paper  focuses  on  a  representation  for  computing  global  similarity.  That  is,  the  task  is  to  find  images 
that,  as  a  whole,  appeal-  visually  similar.  The  utility  of  global  similarity  retrieval  is  evident,  for  example,  in 
finding  similar  scenes  or  similar  faces  in  a  face  database.  Global  similarity  also  works  well  when  the  object 
in  question  constitutes  a  significant  portion  of  the  image.  The  usefulness  of  the  representation  in  retrieving 
images  with  good  precision  is  demonstrated  using  a  database  of  about  1500  images  of  faces,  apes,  cars  and 
other  miscellaneous  objects  and  with  another  database  of  about  2000  images  of  trademarks. 

2.1  Appearance  based  retrieval 

The  image  intensity  surface  is  robustly  characterized  using  features  obtained  from  responses  to  multi¬ 
scale  Gaussian  derivative  filters.  Koenderink  [13]  and  others  [6]  have  argued  that  the  local  structure  of  an 
image  can  be  represented  by  the  outputs  of  a  set  of  Gaussian  derivative  filters  applied  to  an  image.  That 
is,  images  are  filtered  with  Gaussian  derivatives  at  several  scales  and  the  resulting  response  vector  locally 
describes  the  structure  of  the  intensity  surface.  By  computing  features  derived  from  the  local  response 
vector  and  accumulating  them  over  the  image,  robust  representations  appropriate  to  querying  images  as  a 
whole  (global  similarity)  can  be  generated.  One  such  representation  uses  histograms  of  features  derived 
from  the  multi-scale  Gaussian  derivatives.  Histograms  form  a  global  representation  because  they  capture 
the  distribution  of  local  features  (A  histogram  is  one  of  the  simplest  ways  of  estimating  a  non  parametric 
distribution).  This  global  representation  can  be  efficiently  used  for  global  similarity  retrieval  by  appearance 
and  retrieval  is  very  fast. 

The  choice  of  features  often  determines  how  well  the  image  retrieval  system  performs.  Here  the  task  is  to 
robustly  characterize  the  3-dimensional  intensity  surface.  A  3-dimensional  surface  is  uniquely  determined  if 
the  local  curvatures  everywhere  are  known.  Thus,  it  is  appropriate  that  one  of  the  features  be  local  curvature. 
The  principal  curvatures  of  the  intensity  surface  are  invariant  to  image  plane  rotations,  monotonic  intensity 
variations  and  further,  their  ratios  are  in  principle  insensitive  to  scale  variations  of  the  entire  image.  However, 
spatial  orientation  information  is  lost  when  constructing  histograms  of  curvature  (or  ratios  thereof)  alone. 
Therefore  we  augment  the  local  curvature  with  local  phase,  and  the  representation  uses  histograms  of  local 
curvature  and  phase. 

Local  principal  curvatures  and  phase  are  computed  at  several  scales  from  responses  to  multi-scale  Gaus¬ 
sian  derivative  filters.  Then  histograms  of  the  curvature  ratios  [12,  3]  and  phase  are  generated.  Thus,  the 
image  is  represented  by  a  single  vector  (multi-scale  histograms).  During  run-time  the  user  presents  an  ex¬ 
ample  image  as  a  query  and  the  query  histograms  are  compared  with  the  ones  stored,  and  the  images  are 
then  ranked  and  displayed  in  order  to  the  user. 

2.2  The  choice  of  domain 

There  are  two  issues  in  building  a  content  based  image  retrieval  system.  The  first  issue  is  technological, 
that  is,  the  development  of  new  techniques  for  searching  images  based  on  their  content.  The  second  issue 
is  user  or  task  related,  in  the  sense  of  whether  the  system  satisfies  a  user  need.  While  a  number  of  content 
based  retrieval  systems  have  been  built  ([5,  4]),  it  is  unclear  what  the  puipose  of  such  systems  is  and  whether 
people  would  actually  search  in  the  fashion  described. 

In  this  paper  we  describe  how  the  techniques  described  here  may  be  scaled  to  retrieve  images  from  a 
database  of  about  63000  trademark  images  provided  by  the  US  Patent  and  Trademark  Office.  This  database 
consists  of  all  (at  the  time  the  database  was  provided)  the  registered  trademarks  in  the  United  States  which 
consist  only  of  designs  (i.e.  there  are  no  words  in  them).  Trademark  images  are  a  good  domain  with 
which  to  test  image  retrieval.  First,  there  is  an  existing  user  need:  trademark  examiners  do  have  to  check  for 
trademark  conflicts  based  on  visual  appearance.  That  is,  at  some  stage  they  are  required  to  look  at  the  images 
and  check  whether  the  trademark  is  similar  to  an  existing  one.  Second,  trademark  images  may  consist  of 
simple  geometric  designs,  pictures  of  animals  or  even  complicated  designs.  Thus,  they  provide  a  test-bed 
for  image  retrieval  algorithms.  Third,  there  is  text  associated  with  every  trademark  and  the  associated  text 
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maybe  used  in  a  number  of  ways.  One  of  the  problems  with  many  image  retrieval  systems  is  that  it  is  unclear 
where  the  example  or  query  image  will  come  from.  In  this  paper,  the  associated  text  is  used  to  provide  an 
example  or  query  image.  In  future  papers,  we  will  explore  how  text  and  image  searches  may  be  combined 
to  build  more  sophisticated  systems.  Using  trademark  images  does  have  some  limitations.  First,  we  arc 
restricted  to  binary  images  (albeit  large  ones).  As  shown  later  in  the  paper,  this  does  not  create  any  problems 
for  the  algorithms  described  here.  Second,  in  some  cases  the  use  of  abstract  images  makes  the  task  more 
difficult.  Others  have  attempted  to  get  around  it  by  restricting  the  trademark  images  to  geometric  designs 
[8], 

3  Related  Work 

Several  authors  have  tried  to  characterize  the  appearance  of  an  object  via  a  description  of  the  intensity 
surface.  In  the  context  of  object  recognition  [20]  represent  the  appearance  of  an  object  using  a  parametric 
eigen  space  description.  This  space  is  constructed  by  treating  the  image  as  a  fixed  length  vector,  and  then 
computing  the  principal  components  across  the  entire  database.  The  images  therefore  have  to  be  size  and  in¬ 
tensity  normalized,  segmented  and  trained.  Similarly,  using  principal  component  representations  described 
in  [10]  face  recognition  is  performed  in  [25].  In  [23]  the  traditional  eigen  representation  is  augmented  by 
using  most  discriminant  features  and  is  applied  to  image  retrieval.  The  authors  apply  eigen  representation 
to  retrieval  of  several  classes  of  objects.  The  issue,  however  ,  is  that  these  classes  arc  manually  determined 
and  training  must  be  performed  on  each.  The  approach  presented  in  this  paper  is  different  from  all  the  above 
because  eigen  decompositions  arc  not  used  at  all  to  characterize  appearance.  Further,  the  method  presented 
uses  no  learning  and,  does  not  require  constant  sized  images.  It  should  be  noted  that  although  learning  sig¬ 
nificantly  helps  in  such  applications  as  face  recognition,  however,  it  may  not  be  feasible  in  many  instances 
where  sufficient  examples  arc  not  available.  This  system  is  designed  to  be  applied  to  a  wide  class  of  images 
and  there  is  no  restriction  per  se. 

In  earlier  work  we  showed  that  local  features  computed  using  Gaussian  derivative  filters  can  be  used  for 
local  similarity,  i.e.  to  retrieve  parts  of  images  [17].  Here  we  argue  that  global  similarity  can  be  determined 
by  computing  local  features  and  comparing  distributions  of  these  features.  This  technique  gives  good  results, 
and  is  reasonably  tolerant  to  view  variations.  Schiele  and  Crowley  [22]  used  such  a  technique  for  recognizing 
objects  using  grey-level  images.  Their  technique  used  the  outputs  of  Gaussian  derivatives  as  local  features. 
A  multi-dimensional  histogram  of  these  local  features  is  then  computed.  Two  images  are  considered  to  be  of 
the  same  object  if  they  had  similar  histograms.  The  difference  between  this  approach  and  the  one  presented 
by  Schiele  and  Crowley  is  that  here  we  use  ID  histograms  (as  opposed  to  multi-dimensional)  and  further 
use  the  principal  curvatures  as  the  primary  feature. 

The  use  of  Gaussian  derivative  filters  to  represent  appearance  is  motivated  by  their  use  in  describing  the 
spatial  structure  [13]  and  its  uniqueness  in  representing  the  scale  space  of  a  function  [14,  11,  27,  24]  The 
invariance  properties  of  the  principal  curvatures  are  well  documented  in  [6].  Nastar  [19],  has  independently 
used  the  image  shape  index  to  compute  similarity  between  images.  However,  in  his  work  curvatures  were 
computed  only  at  a  single  scale.  This  is  insufficient. 

In  the  context  of  global  similarity  retrieval  it  should  be  noted  that  representations  using  moment  invariants 
have  been  well  studied  [18].  In  these  methods  global  representation  of  appearance  may  involve  computing 
a  few  numbers  over  the  entire  image.  Two  images  arc  then  considered  similar  if  these  numbers  are  close 
to  each  other  (say  using  an  L2  norm).  We  argue  that  such  representations  arc  not  able  to  really  capture  the 
“appearance”  of  an  image,  particularly  in  the  context  of  trademark  retrieval  where  moment  invariants  arc 
widely  used.  In  other  work  [17]  we  compared  moment  invariants  with  the  technique  presented  here  and 
found  that  moment  invariants  work  best  for  a  single  binary  shape  without  holes  in  it,  and,  in  general,  fare 
worse  than  the  method  presented  here.  Jain  and  Vailaya  [9]  used  edge  angles  and  invariant  moments  to 
prune  trademark  collections  and  then  use  template  matching  to  find  similarity  within  the  pruned  set.  Their 
database  was  limited  to  1100  images. 
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Texture  based  image  retrieval  is  also  related  to  the  appearance  based  work  presented  in  this  paper.  Using 
Wold  modeling,  in  [15]  the  authors  try  to  classify  the  entire  Brodatz  texture  and  in  [7]  attempt  to  classify 
scenes,  such  as  city  and  country.  Of  particular-  interest  is  work  by  [16]  who  use  Gabor  filters  to  retrieve 
texture  similar  images. 

The  earliest  general  image  retrieval  systems  were  designed  by  [5,  21].  In  [5]  the  shape  queries  require 
prior  manual  segmentation  of  the  database  which  is  undesirable  and  not  practical  for  most  applications. 

4  Global  representation  of  appearance 

Three  steps  are  involved  in  order  to  computing  global  similarity.  First,  local  derivatives  are  computed  at 
several  scales.  Second,  derivative  responses  are  combined  to  generate  local  features,  namely,  the  principal 
curvatures  and  phase  and,  their  histograms  are  generated.  Third,  the  ID  curvature  and  phase  histograms 
generated  at  several  scales  are  matched.  These  steps  are  described  next. 

A.  Computing  local  derivatives:  Computing  derivatives  using  finite  differences  does  not  guarantee  sta¬ 
bility  of  derivatives.  In  order  to  compute  derivatives  stably,  the  image  must  be  regularized,  or  smoothed  or 
band-limited.  A  Gaussian  filtered  image  Ia  =  I  *  G  obtained  by  convolving  the  image  I  with  a  normalized 
Gaussian  G(r,  a)  is  a  band-limited  function.  Its  high  frequency  components  are  eliminated  and  derivatives 
will  be  stable.  In  fact,  it  has  been  argued  by  Koenderink  and  van  Doom  [13]  and  others  [6]  that  the  local 
structure  of  an  image  I  at  a  given  scale  can  be  represented  by  filtering  it  with  Gaussian  derivative  filters  (in 
the  sense  of  a  Taylor  expansion),  and  they  term  it  the  N-jet. 

However,  the  shape  of  the  smoothed  intensity  surface  depends  on  the  scale  at  which  it  is  observed.  For 
example,  at  a  small  scale  the  texture  of  an  ape’s  coat  will  be  visible.  At  a  large  enough  scale,  the  ape’s 
coat  will  appeal-  homogeneous.  A  description  at  just  one  scale  is  likely  to  give  rise  to  many  accidental  mis¬ 
matches.  Thus  it  is  desirable  to  provide  a  description  of  the  image  over  a  number  of  scales,  that  is,  a  scale 
space  description  of  the  image.  It  has  been  shown  by  several  authors  [14,  11,  27,  24,  6],  that  under  certain 
general  constraints,  the  Gaussian  filter  forms  a  unique  choice  for  generating  scale-space.  Thus  local  spatial 
derivatives  are  computed  at  several  scales. 

B.  Feature  Histograms:  The  normal  and  tangential  curvatures  of  a  3-D  surface  (X,Y, Intensity)  are  defined 
as  [6]: 


T 


N  (p,  cf) 


I'xlyy  +  lyl-xx  lylxy 


(P^) 

(P>*) 


Where  Ix  (p,  a)  and  Iy  (p,  a)  are  the  local  derivatives  of  Image  I  around  point  p  using  Gaussian  derivative 
at  scale  cr.  Similarly  Ixx  (■,  •),  Ixy  (■,  •),  and  Iyy  (■,  •)  are  the  corresponding  second  derivatives.  The  normal 
curvature  N  and  tangential  curvature  T  are  then  combined  [12]  to  generate  a  shape  index  as  follows: 


C  (p,  a)  =  atari 


' N  +  T ' 
N-T 


(P>*) 


The  index  value  C  is  |  when  N  =  T  and  is  undefined  when  either  N  and  T  are  both  zero,  and  is, 
therefore,  not  computed.  This  is  interesting  because  very  flat  portions  of  an  image  (or  ones  with  constant 
ramp)  are  eliminated.  For  example  in  Figure  2(middle-row),  the  background  in  most  of  these  face  images 
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does  not  contribute  to  the  curvature  histogram.  The  curvature  index  or  shape  index  is  rescaled  and  shifted 
to  the  range  [0, 1]  as  is  done  in  [3].  A  histogram  is  then  computed  of  the  valid  index  values  over  an  entire 
image. 

The  second  feature  used  is  phase.  The  phase  is  simply  defined  as  P  (p,  a)  =  atan2  (Iy  (p,  a)  ,  Ix  (p,  a) ) . 
Note  that  P  is  defined  only  at  those  locations  where  C  is  and  ignored  elsewhere.  As  with  the  curvature  index 
P  is  rescaled  and  shifted  to  lie  between  the  interval  [0, 1]. 

At  different  scales  different  local  structures  are  observed  and,  therefore,  multi-scale  histograms  arc 
a  more  robust  representation.  Consequently,  a  feature  vector  is  defined  for  an  image  I  as  the  vector 
Vi  =  (II(:  (<7i) . . .  IIC  ( an )  ,  Hp  (<7i) . . .  Up  (crn))  where  IIp  and  Hc  are  the  curvature  and  phase  histograms 
respectively.  We  found  that  using  5  scales  gives  good  results  and  the  scales  are  1  ■  •  ■  4  in  steps  of  half  an 
octave. 

C.  Matching  feature  histograms:  Two  feature  vectors  arc  compared  using  normalized  cross-covariance 
defined  as 

y{mn)  '  y{m) 


yM 

y(“) 

v  i 

V  j 

where  =  Vi  —  mean(Vi). 

Retrieval  is  carried  out  as  follows.  A  query  image  is  selected  and  the  query  histogram  vector  Vq  is 
correlated  with  the  database  histogram  vectors  Vt  using  the  above  formula.  Then  the  images  are  ranked  by 
their  correlation  score  and  displayed  to  the  user.  In  this  implementation,  and  for  evaluation  puiposes,  the 
ranks  are  computed  in  advance,  since  every  query  image  is  also  a  database  image. 

4.1  Experiments 

The  curvature -phase  method  is  tested  using  two  databases.  The  first  is  a  trademark  database  of  2048 
images  obtained  from  the  US  Patent  and  Trademark  Office  (PTO).  The  images  obtained  from  the  PTO  are 
large,  binary  and  arc  converted  to  gray-level  and  reduced  for  the  experiments.  The  second  database  is  a  col¬ 
lection  of  1561  assorted  gray-level  images.  This  database  has  digitized  images  of  cars,  steam  locomotives, 
diesel  locomotives,  apes,  faces,  people  embedded  in  different  background(s)  and  a  small  number  of  other 
miscellaneous  objects  such  as  houses.  These  images  were  obtained  from  the  Internet  and  the  Corel  photo- 
cd  collection  and  were  taken  with  several  different  cameras  of  unknown  parameters,  and  under  varying 
uncontrolled  lighting  and  viewing  geometry. 

In  the  following  experiments  an  image  is  selected  and  submitted  as  a  query.  The  objective  of  this  query 
is  stated  and  the  relevant  images  arc  decided  in  advance.  Then  the  retrieval  instances  are  gauged  against  the 
stated  objective.  In  general,  objectives  of  the  form  'extract  images  similar  in  appearance  to  the  query’  will 
be  posed  to  the  retrieval  algorithm.  A  measure  of  the  performance  of  the  retrieval  engine  can  be  obtained 
by  examining  the  recall/precision  table  for  several  queries.  Briefly,  recall  is  the  proportion  of  the  relevant 
material  actually  retrieved  and  precision  is  the  proportion  of  retrieved  material  that  is  relevant  [26].  It  is  a 
standard  widely  used  in  the  information  retrieval  community  and  is  one  that  is  adopted  here. 

Queries  were  submitted  each  to  the  trademark  and  assorted  image  collection  for  the  purpose  of  computing 
recall/precision.  The  judgment  of  relevance  is  qualitative.  For  each  query  in  both  databases  the  relevant 
images  were  decided  in  advance.  These  were  restricted  to  48.  The  top  48  ranks  were  then  examined  to  check 
the  proportion  of  retrieved  images  that  were  relevant.  All  images  not  retrieved  within  48  were  assigned  a 
rank  equal  to  the  size  of  the  database.  That  is,  they  arc  not  considered  retrieved.  These  ranks  were  used  to 
interpolate  and  extrapolate  precision  at  all  recall  points.  In  the  case  of  assorted  images  relevance  is  easier 
to  determine  and  more  similar  for  different  people.  However  in  the  trademark  case  it  can  be  quite  difficult 
and  therefore  the  recall-precision  can  be  subject  to  some  error.  The  recall/precision  results  arc  summarized 
in  Table  1  and  both  databases  arc  individually  discussed  below. 
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Figure  1 :  Trademark  retrieval  using  Curvature  and  Phase 


Figure  2:  Image  retrieval  using  Curvature  and  Phase 

Figure  1  shows  the  performance  of  the  algorithm  on  the  trademark  images.  Each  ship  depicts  the  top  8 
retrievals,  given  the  leftmost  as  the  query.  Most  of  the  shapes  have  roughly  the  same  structure  as  the  query. 
Note  that,  outline  and  solid  figures  arc  treated  similarly  (see  rows  one  and  two  in  Figure  1).  Six  queries  were 
submitted  for  the  puipose  of  computing  recall-precision  in  Table  1. 

Experiments  arc  also  earned  out  with  assorted  gray  level  images.  Six  queries  submitted  for  recall- 
precision  arc  shown  in  Figure  2.  The  left  most  image  in  each  row  is  the  query  and  is  also  the  first  retrieved. 
The  rest  from-left  to  right  arc  seven  retrievals  depicted  in  rank  order.  Note  that,  flat  portions  of  the  back- 
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Table  1 :  Precision  at  standard  recall  points  for  six  Queries 


Recall 

0 

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

Precision! trademark)  % 

100 

93.2 

93.2 

85.2 

76.3 

74.5 

59.5 

45.5 

27.2 

9.0 

9.0 

Precision! assorted)  % 

100 

92.6 

90.0 

88.3 

87.0 

86.8 

83.8 

65.9 

21.3 

12.0 

1.4 

average(trademark) 

61.1% 

average(assorted) 

66.3% 

ground  are  never  considered  because  the  principal  curvatures  are  very  close  to  zero  and  therefore  do  not 
contribute  to  the  final  score.  Thus,  for  example,  the  flat  background  in  Figure  2(second  row)  is  not  used. 
Notice  that  visually  similar  images  arc  retrieved  even  when  there  is  some  change  in  the  background  (row 
1).  This  is  because  the  dominant  object  contributes  most  to  the  histograms.  In  using  a  single  scale  poorer 
results  arc  achieved  and  background  affects  the  results  more  significantly. 

The  results  of  these  examples  are  discussed  below,  with  the  precision  over  all  recall  points  depicted  in 
parentheses.  For  comparison  the  best  text  retrieval  engines  have  an  average  precision  of  50%: 


1.  Find  similar  cars(65%).  Pictures  of  cars  viewed  from  similar  orientations  appeal-  in  the  top  ranks 
because  of  the  contribution  of  the  phase  histogram.  This  result  also  shows  that  some  background 
variation  can  be  tolerated.  The  eighth  retrieval  although  a  car  is  a  mismatch  and  is  not  considered. 

2.  Find  same  face(87.4%)  and  find  similar  faces:  In  the  face  query  the  objective  is  to  find  the  same  face. 
In  experiments  with  a  University  of  Bern  face  database  of  300  faces  with  a  10  relevant  faces  each,  the 
average  precision  over  all  recall  points  for  all  300  queries  was  78%.  It  should  be  noted  that  the  system 
presented  here  works  well  for  faces  with  the  same  representation  and  parameters  used  for  all  the  other 
databases.  There  is  no  specific  “tuning”  or  learning  involved  to  retrieve  faces.  The  query  “find  similar 
faces”  resulted  in  a  100%  precision  at  48  ranks  because  there  are  far  more  faces  than  48.  Therefore,  it 
was  not  used  in  the  final  precision  computation. 

3.  Find  dark  textured  apes  (64.2%).  The  ape  query  results  in  several  other  light  textured  apes  and  country 
scenes  with  similar  texture.  Although  these  are  not  mis-matches  they  are  not  consistent  with  the  intent 
of  the  query  which  is  to  find  dark  textured  apes. 

4.  Find  other  patas  monkeys.  (47.1%)  Here  there  are  16  patas  monkeys  in  all  and  9  within  a  small  view 
variation.  However,  here  the  whole  image  is  being  matched  so  the  number  of  relevant  patas  monkeys 
is  16.  The  precision  is  low  because  the  method  cannot  distinguish  between  light  and  dark  textures, 
leading  to  irrelevant  images.  Note,  that  it  finds  other  apes,  dark  textured  ones,  but  those  are  deemed 
irrelevant  with  respect  to  the  query. 

5.  Given  a  wall  with  a  Coca  Cola  logo  find  other  Coca  Cola  images  (63.8%).  This  query  clearly  depicts 
the  limitation  of  global  matching.  Although  all  three  database  images  that  had  a  certain  texture  of  the 
wall  (also  had  Coca  Cola  logos)  were  retrieved  (100%  precision),  two  other  very  dissimilar  images 
with  coca-cola  logos  were  not. 

6.  Scenes  with  Bill  Clinton  (72.8%).  The  retrieval  in  this  case  results  in  several  mismatches.  However, 
three  of  the  four  are  retrieved  in  succession  at  the  top  and  the  scenes  appeal-  visually  similar. 


While  the  queries  presented  here  are  not  “optimal”  with  respect  to  the  design  constraints  of  global  sim¬ 
ilarity  retrieval,  they  are  however,  realistic  queries  that  can  be  posed  to  the  system.  Mismatches  can  and 
do  occur.  The  first  is  the  case  where  the  global  appearance  is  very  different.  The  Coca  Cola  retrieval  is  a 
good  example  of  this.  Second,  mismatches  can  occur  at  the  algorithmic  level.  Histograms  coarsely  repre¬ 
sent  spatial  information  and  therefore  will  admit  images  with  non-trivial  deformations.  The  recall/precision 


presented  here  compares  well  with  text  retrieval.  The  time  per  retrieval  is  of  the  order  of  milli-seconds.  In 
the  next  section  we  discuss  the  application  of  the  presented  technique  to  a  database  of  63000  images. 

5  Trademark  Retrieval 

The  system  indexes  63,718  trademarks  from  the  US  Patent  and  Trademark  office  in  the  design  only 
category.  These  trademarks  are  binary  images.  In  addition,  associated  text  consists  of  a  design  code  that 
designates  the  type  of  trademark,  the  goods  and  services  associated  with  the  trademark,  a  serial  number  and 
a  short  descriptive  text. 

The  system  for  browsing  and  retrieving  trademarks  is  illustrated  in  Figure  3.  The  netscape/Java  user 
interface  has  two  search-able  parts.  On  the  left  a  panel  is  included  to  initiate  search  using  text.  Any  or  all  of 
the  fields  can  be  used  to  enter  a  query.  In  this  example,  the  text  “Merriam  Webster’  is  entered  and  all  images 
associated  with  it  arc  retrieved  using  the  Inquery  [1]  text  search  engine.  The  user  can  then  use  any  of  the 
example  pictures  to  search  for  images  that  arc  similar.  In  the  specific  example  shown.  The  second  image  is 
selected  and  retrieved  results  arc  displayed  on  the  right  panel.  The  user  can  then  continue  to  search  using 
any  of  the  displayed  pictures  as  the  query. 

In  this  section  we  adapt  the  curvature/phase  histograms  to  retrieve  visually  similar  trademarks.  The 
following  steps  arc  performed  to  retrieve  images. 

Preprocessing:  Each  binary  image  in  the  database  is  first  size  normalized,  by  clipping.  Then  they  arc 
converted  to  gray-scale  and  reduced  in  size. 

Computation  of  Histograms:  Each  processed  image  is  divided  into  four  equal  rectangular  regions.  This 
is  different  than  constructing  a  histogram  based  on  pixels  of  the  entire  image.  This  is  because  in  scaling  the 
images  to  a  large  collection,  we  found  that  the  added  degree  of  spatial  resolution  significantly  improves  the 
retrieval  performance.  The  curvature  and  phase  histograms  arc  computed  for  each  tile  at  three  scales  (1,4,8). 
A  histogram  descriptor  of  the  image  is  obtained  by  concatenating  all  the  individual  histograms  across  scales 
and  regions. 

These  two  steps  arc  conducted  off-line. 

Execution:  The  image  search  server  begins  by  loading  all  the  histograms  into  memory.  Then  it  waits  on  a 
port  for  a  query.  A  CGI  client  transmits  the  query  to  the  server.  Its  histograms  arc  matched  with  the  ones  in 
the  database.  The  match  scores  arc  ranked  and  the  top  N  requested  retrievals  arc  returned. 

5.1  Examples 

In  Figure  3,  the  user  typed  in  Merriam  Webster  in  the  text  window.  The  system  searches  for  trademarks 
which  have  either  Merriam  or  Webster  in  th  associated  text  and  displays  them.  Here,  the  first  two  trademarks 
(first  two  images  in  the  left  window)  belong  to  Merriam  Webster.  In  this  example,  the  user  has  chosen  to 
’click’  the  second  image  and  search  for  images  of  similar  trademarks.  This  search  is  based  entirely  on  the 
image  and  the  results  are  displayed  in  the  right  window  in  rank  order.  Retrieval  takes  a  few  seconds  and  is 
done  by  comparing  histograms  of  all  63,718  trademarks  on  the  fly. 

The  original  image  is  returned  as  the  first  result  (as  it  should  be).  The  images  in  positions  2,3  and  5  in  the 
second  window  all  contain  circles  inside  squares  and  this  configuration  is  similar  to  that  of  the  query.  Most 
of  the  other  images  arc  of  objects  contained  inside  a  roughly  square  box  and  this  is  reasonable  considering 
that  similarity  is  defined  on  the  basis  of  the  entire  image  rather  than  a  part  of  the  image. 

The  second  example  is  shown  in  Figure  4.  Here  the  user  has  typed  in  the  word  Apple.  The  system  returns 
trademarks  associated  with  the  word  Apple.  The  user  queries  using  Apple  computer’s  logo  (the  image  in 
the  second  row,  first  column  of  the  first  window).  Images  retrieved  in  response  to  this  query  arc  shown 
in  the  right  window.  The  first  eight  retrievals  arc  all  copies  of  Apple  Computer’s  trademark  (Apple  used 
the  same  trademark  for  a  number  of  other  goods  and  so  there  arc  multiple  copies  of  the  trademark  in  the 


9 


Figure  3:  Retrieval  in  response  to  a  “Merriam  Webster”  query 

database).  Trademarks  number  9  and  10  look  remarkably  similar  to  Apple’s  trademark.  They  arc  considered 
valid  trademarks  because  they  are  used  for  goods  and  services  in  areas  other  than  computers.  Trademark 
13  is  another  version  of  Apple  Computer’s  logo  but  with  lines  in  the  middle.  Although  somewhat  visually 
different  it  is  still  retrieved  in  the  high  ranks.  Image  14  is  an  interesting  example  of  a  mistake  made  by  the 
system.  Although  the  image  is  not  of  an  apple,  the  image  has  similar  distributions  of  curvature  and  phase  as 
is  clear  by  looking  at  it. 

The  system  has  been  tried  on  a  variety  of  different  examples  of  both  two  dimensional  and  three  dimen¬ 
sional  pictures  of  trademarks  and  had  worked  quite  well.  Clearly,  there  arc  issues  of  how  quantitative  results 
can  be  obtained  for  such  large  image  databases  (it  is  not  feasible  for  a  person  to  look  at  every  image  in  the 
database  to  determine  whether  it  is  similar).  In  future  work,  we  hope  to  evolve  a  mechanism  for  quantitative 
testing  on  such  large  databases.  It  will  also  be  important  to  use  more  of  the  textual  information  to  determine 
trademark  conflicts. 

6  Conclusions  and  Limitations 

This  paper  demonstrates  retrieval  of  similar  objects  on  the  basis  of  their  visual  appearance.  Visual  appear¬ 
ance  is  characterized  using  filter  responses  to  Gaussian  derivatives  over  scale  space.  In  addition,  we  claim 
that  global  representations  arc  better  constructed  by  representing  the  distribution  of  robustly  computed  local 
features.  In  earlier  experiments  moment  based  representations  were  compared  with  the  presented  method 
and  it  was  found  that  moment  based  representations  in  general  perform  reasonably  when  there  are  whole 
objects  with  no  holes.  Moments  form  weak  descriptors  and  are  sensitive  to  noise  in  the  image. 

Currently  we  arc  investigating  three  issues.  First  is  to  scale  the  database  up  to  about  600000  images. 
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Figure  4:  Retrieval  in  response  to  the  query  “Apple” 

The  second  is  to  incorporate  user  feedback  or  preferences  of  retrieved  images.  The  third  is  to  combine  text 
retrieval  and  image  retrieval  in  a  principled  manner. 
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